[Mlas] optimize MlasConv using thread partition opt#25255
[Mlas] optimize MlasConv using thread partition opt#25255zoeczy wants to merge 3 commits intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR introduces thread partitioning optimization for MlasConv by distributing convolution work across batch and group dimensions to improve multi-threading performance in scenarios with small batch sizes or grouped convolutions.
Key changes:
- Implements a new threaded execution path for
MlasConvAlgorithmExpandThenGemmSegmentedthat partitions work by batch-group pairs - Adds dynamic working buffer size calculation based on batch count and group count
- Includes new threaded benchmark function to evaluate multi-threaded performance improvements
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| onnxruntime/test/mlas/bench/bench_sconv.cpp | Adds threaded benchmark function with 4-thread threadpool for performance testing |
| onnxruntime/core/mlas/lib/convolve.cpp | Implements batch-group partitioning logic and optimized working buffer allocation |
Comments suppressed due to low confidence (1)
onnxruntime/core/mlas/lib/convolve.cpp:1379
- Variable name 'WorkingBufferSizePreThread' should be 'WorkingBufferSizePerThread' (fix typo: 'Pre' -> 'Per').
size_t WorkingBufferSizePreThread = std::max(Parameters->OutputSize * Parameters->K,
|
|
||
| *WorkingBufferSize = TargetThreadCount * MLAS_CONV_WORKING_BUFFER_SIZE_PER_THREAD; | ||
|
|
||
| if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){ |
There was a problem hiding this comment.
Inconsistent indentation: line uses tab character while surrounding code uses spaces. This should be indented with spaces to match the existing code style.
| if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){ | |
| if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){ |
|
|
||
| *WorkingBufferSize = TargetThreadCount * MLAS_CONV_WORKING_BUFFER_SIZE_PER_THREAD; | ||
|
|
||
| if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){ |
There was a problem hiding this comment.
Missing space after 'if' and around operators. Should be formatted as 'if (Parameters->BatchCount > 1 || Parameters->GroupCount > 1) {' to match C++ style conventions.
| if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){ | |
| if (Parameters->BatchCount > 1 || Parameters->GroupCount > 1) { |
| const size_t OutputGroupSize = FilterCount * OutputSize; | ||
| const size_t FilterGroupSize = FilterCount * K; | ||
|
|
||
| // std::cout << "Address of WorkBlock->WorkingBuffer" << WorkBlock->WorkingBuffer << std::endl; |
There was a problem hiding this comment.
Debug output statement should be removed from production code. This commented-out debug line should be deleted.
| // std::cout << "Address of WorkBlock->WorkingBuffer" << WorkBlock->WorkingBuffer << std::endl; | |
| // Line removed. |
| const size_t BatchGroupCount = BatchCount * GroupCount; | ||
|
|
||
| int32_t TargetThreadCount = MlasGetMaximumThreadCount(ThreadPool); | ||
| // TargetThreadCount = 16; |
There was a problem hiding this comment.
Commented-out hardcoded thread count should be removed from production code. This appears to be leftover debugging code.
| // TargetThreadCount = 16; |
|
Could you please address Copilot's review comments ? |
…tion opt (#26103) ### Description This is an internal branch dupe of #25255 + some minor cosmetic changes to account for Copilot feedback ### Motivation and Context Improve performance of NCHW Conv - Both grouped convolutions and batched inputs should benefit from this change. For a detailed understanding of perf improvement, please refer to the numbers in #25255. Credit to @zoeczy and team for this improvement and code change --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
|
Closing as its internal dupe has been merged - thanks for the contribution ! |
…tion opt (#26103) ### Description This is an internal branch dupe of #25255 + some minor cosmetic changes to account for Copilot feedback ### Motivation and Context Improve performance of NCHW Conv - Both grouped convolutions and batched inputs should benefit from this change. For a detailed understanding of perf improvement, please refer to the numbers in #25255. Credit to @zoeczy and team for this improvement and code change --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
…tion opt (#26103) ### Description This is an internal branch dupe of #25255 + some minor cosmetic changes to account for Copilot feedback ### Motivation and Context Improve performance of NCHW Conv - Both grouped convolutions and batched inputs should benefit from this change. For a detailed understanding of perf improvement, please refer to the numbers in #25255. Credit to @zoeczy and team for this improvement and code change --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Adds the following commits to the release-1.23.2 branch for ORT 1.23.2: - [TensorRT] Fix DDS output bug during engine update - PR: #26272 - commit id: 00e85dd - Fix shape inference failure with in-memory external data - PR: #26263 - commit id: d955476 - [CUDA] replace 90a-virtual by 90-virtual for forward compatible - PR: #26230 - commit id: b58911f - [QNN-EP] Fix logic flow bug - PR: #26148 - commit id: b282379 - Internal Dupe of #25255 - [MLAS] Optimize MlasConv using thread partition opt - PR: #26103 - commit id: 7362518 - Update qMoE spec to support block quantization - PR: #25641 - commit id: 7a8ffa8 - [VitisAI] add new api to VitisAI to save graph as a string - PR: #25602 - commit id: 3361d72 - [[Build] Lock torch, onnxscript and onnx-ir versions to latest] - PR: #26315 - commit id: ea69c4d --------- Co-authored-by: Hariharan Seshadri <shariharan91@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Yateng Hong <toothache9010@gmail.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: quic-calvnguy <quic_calvnguy@quicinc.com> Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com> Co-authored-by: yifei410 <31260809+yifei410@users.noreply.github.com> Co-authored-by: yifei <y.zhou@xilinx.com>
…ead partition opt (microsoft#26103) ### Description This is an internal branch dupe of microsoft#25255 + some minor cosmetic changes to account for Copilot feedback ### Motivation and Context Improve performance of NCHW Conv - Both grouped convolutions and batched inputs should benefit from this change. For a detailed understanding of perf improvement, please refer to the numbers in microsoft#25255. Credit to @zoeczy and team for this improvement and code change --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
…tion opt (#26103) ### Description This is an internal branch dupe of #25255 + some minor cosmetic changes to account for Copilot feedback ### Motivation and Context Improve performance of NCHW Conv - Both grouped convolutions and batched inputs should benefit from this change. For a detailed understanding of perf improvement, please refer to the numbers in #25255. Credit to @zoeczy and team for this improvement and code change --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
…ead partition opt (microsoft#26103) ### Description This is an internal branch dupe of microsoft#25255 + some minor cosmetic changes to account for Copilot feedback ### Motivation and Context Improve performance of NCHW Conv - Both grouped convolutions and batched inputs should benefit from this change. For a detailed understanding of perf improvement, please refer to the numbers in microsoft#25255. Credit to @zoeczy and team for this improvement and code change --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Description
This PR enhances
MlasConvin ONNX Runtime by introducing a thread partitioning strategy based on Batch Size (bs) and Group Count (group). This change improves multi-threading efficiency in convolution scenarios where scaling with core/thread count was previously limited.The PR also includes updates to the
bench_sconvutility to support and evaluate multi-threaded performance under the new partitioning strategy.numactl -C core_num0-core_num_1 ./onnxruntime_mlas_benchmark --benchmark_filter=TeamsCompared to the current master implementation, the optimized version exhibits nearly 3× performance improvement, showing effective scaling with thread count. In contrast, the master branch shows no meaningful performance gain when increasing the number of threads, due to insufficient parallelization in the original implementation.
Motivation and Context
MlasConvexhibited minimal performance gains when increasing the number of threads or CPU cores in scenarios with small batch sizes or grouped convolutions.bench_sconvshow a noticeable performance improvement in multi-threaded runs, especially on multi-core CPUs.Releated Issues
#25152